-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
IO: Fix parquet read from s3 directory #33632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
path.close() | ||
|
||
parquet_ds = self.api.parquet.ParquetDataset( | ||
path, filesystem=get_fs_for_path(path), **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this filesystem=get_fs_for_path(path)
needed? What happens if you just pass the path? (which I assume has eg a s3://..
in it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow seems to only allow a file path opposed to a dir path. Removing filesystem arg here throws:
for path in path_or_paths:
if not fs.isfile(path):
raise IOError('Passed non-file path: {0}'
> .format(path))
E OSError: Passed non-file path: s3://pandas-test/parquet_dir
../../../.conda/envs/pandas-dev/lib/python3.7/site-packages/pyarrow/parquet.py:1229: OSError
To repo see the test case test_s3_roundtrip_for_dir
I wrote below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. I see now in pyarrow that apparently string URIs with "s3://..." are not supported (while "hdfs://" is supported). That's something we should fix on the pyarrow side as well. But of course until then this is fine.
can you rebase |
pandas/io/parquet.py
Outdated
@@ -92,8 +97,7 @@ def write( | |||
**kwargs, | |||
): | |||
self.validate_dataframe(df) | |||
path, _, _, _ = get_filepath_or_buffer(path, mode="wb") | |||
|
|||
file_obj, _, _, _ = get_filepath_or_buffer(path, mode="wb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche think we can clean up the write
method here to get rid of get_filepath_or_buffer
similar to what i've done below for read
. Will address in different PR.
sure merged + green |
pandas/io/parquet.py
Outdated
@@ -92,7 +97,7 @@ def write( | |||
**kwargs, | |||
): | |||
self.validate_dataframe(df) | |||
path, _, _, should_close = get_filepath_or_buffer(path, mode="wb") | |||
file_obj, _, _, should_close = get_filepath_or_buffer(path, mode="wb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you didn't change path
to file_obj
in the if partition_cols is not None:
block. Was that on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was indeed on purpose, write_to_dataset
doesn't support a file like object when that path is a directory.
import pyarrow.parquet
import pandas as pd
from pandas.io.common import get_filepath_or_buffer
path = "s3://pandas-test/dev"
file_obj, _,_,_, = get_filepath_or_buffer(path)
df = pd.DataFrame({"a": [1,2], "b": [3,4]})
table = pyarrow.Table.from_pandas(df)
# Works
pyarrow.parquet.write_to_dataset(table, path, partition_cols=["a"])
# Throws AttributeError: 'NoneType' object has no attribute '_isfilestore
pyarrow.parquet.write_to_dataset(table, file_obj, partition_cols=["a"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can you add some clarifying comments for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this always a file_obj, never a path? e.g. should rename to filepath_or_buffer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, i've renamed to file_obj_or_path
since when a local path is passed in a path str is returned.
Add clarifying comment
Hello @alimcmaster1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-04-26 20:58:49 UTC |
pandas/io/common.py
Outdated
@@ -150,6 +150,23 @@ def urlopen(*args, **kwargs): | |||
return urllib.request.urlopen(*args, **kwargs) | |||
|
|||
|
|||
def get_fs_for_path(filepath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this (and the return annotation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left return type for now since it include optional dependencies.
e.g Union[s3fs.S3FileSystem, gcsfs.GCSFileSystem, None]
Can add imports to the TYPE_CHECKING block at the top if that's appropriate?
pandas/io/common.py
Outdated
def get_fs_for_path(filepath): | ||
""" | ||
Get appropriate filesystem given a filepath. | ||
Support s3fs, gcs and local disk fs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a full doc-string Paramateres / Returns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure done :)
pandas/io/parquet.py
Outdated
@@ -92,7 +97,7 @@ def write( | |||
**kwargs, | |||
): | |||
self.validate_dataframe(df) | |||
path, _, _, should_close = get_filepath_or_buffer(path, mode="wb") | |||
file_obj, _, _, should_close = get_filepath_or_buffer(path, mode="wb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this always a file_obj, never a path? e.g. should rename to filepath_or_buffer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc-string comment + need to merge master
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -585,6 +585,8 @@ I/O | |||
unsupported HDF file (:issue:`9539`) | |||
- Bug in :meth:`~DataFrame.to_parquet` was not raising ``PermissionError`` when writing to a private s3 bucket with invalid creds. (:issue:`27679`) | |||
- Bug in :meth:`~DataFrame.to_csv` was silently failing when writing to an invalid s3 bucket. (:issue:`32486`) | |||
- :func:`read_parquet` now supports an s3 directory (:issue:`26388`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you review the doc-strings to see if they need updating (e.g. may need a versionadded tag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet Docs strings indicate we already supported this I think? I updated the whatsnew and added an example in docs strings.
thanks @alimcmaster1 |
Co-authored-by: alimcmaster1 <[email protected]>
if should_close: | ||
path.close() | ||
|
||
parquet_ds = self.api.parquet.ParquetDataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change breaks clients that pass a file-like object for path
. ParquetDataset
doesn't provide the same file-like object handling that the original get_filepath_or_buffer
did.
Here's the call stack I'm seeing:
.tox/test/lib/python3.7/site-packages/pandas/io/parquet.py:315: in read_parquet
return impl.read(path, columns=columns, **kwargs)
.tox/test/lib/python3.7/site-packages/pandas/io/parquet.py:131: in read
path, filesystem=get_fs_for_path(path), **kwargs
.tox/test/lib/python3.7/site-packages/pyarrow/parquet.py:1162: in __init__
self.paths = _parse_uri(path_or_paths)
.tox/test/lib/python3.7/site-packages/pyarrow/parquet.py:47: in _parse_uri
path = _stringify_path(path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed bug report #34467
Also skip for it_IT locale
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
(Seems to have also fixed the xfailing test in #33077)
NOTE: lets merge #33645 first - since that fixes up a crucial bit of error handling around this functionality.